About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

The first excercise

Introduction to Open Data Science is an exciting course which helps us to understand how to efficiently use tools such as GitHub, R and RStudio. My personal repository can be found here.


  • Describe your work and results clearly.
  • Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods
  • Assume the reader has no previous knowledge of your data or the more advanced methods you are using

Data analysis for IODS week 2

The data has been read into the system and it has been collected 3.12.2014 – 10.1.2015. The original results were from an international survey of Approaches to Learning. The following data includes variables as described in here. The summary variables (such as deep_adj) answer to questions like how deeply the individual is trying to learn and how organized is his/her studying.

Next, I will briefly examine the data set’s dimensions and structure.

# First six rows of the data set
head(students2014)
##   gender Age Attitude Points Deep_adj Surf_adj Stra_adj
## 1      F  53       37     25 3.583333 2.583333    3.375
## 2      M  55       31     12 2.916667 3.166667    2.750
## 3      F  49       25     24 3.500000 2.250000    3.625
## 4      M  53       35     10 3.500000 2.250000    3.125
## 5      M  49       37     22 3.666667 2.833333    3.625
## 6      F  38       38     21 4.750000 2.416667    3.625
# The dimensions
dim(students2014)
## [1] 166   7
# Type of variables
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ Deep_adj: num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Surf_adj: num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Stra_adj: num  3.38 2.75 3.62 3.12 3.62 ...
# Summaríes of the variables included in the data
summary(students2014)
##  gender       Age           Attitude         Points         Deep_adj    
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   1st Qu.:3.333  
##          Median :22.00   Median :32.00   Median :23.00   Median :3.667  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72   Mean   :3.680  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75   3rd Qu.:4.083  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00   Max.   :4.917  
##     Surf_adj        Stra_adj    
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:2.417   1st Qu.:2.625  
##  Median :2.833   Median :3.188  
##  Mean   :2.787   Mean   :3.121  
##  3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.333   Max.   :5.000

So, from the above we can conclude that the data has 166 observations and each observation consists of 7 variables as the data frame has 166 rows and 7 columns. The data consists of mainly numerical information on the respondents

Of the respondents 110 identified themselves as female and 56 as male. Respondents were 17–55 years old and half of them were 21–27 years old. They got an average of 22.72 points on the final exam and were somewhat deep learners (\(\mu\) = 3.68). You could say they tend to be strategic learners, because upon implementing a t-test for the mean of Stra_adj, I discovered that the mean differs significantly from 3 (\(p \approx 0.446\)). The t-test is a classical statistic test which tests if the mean from a group of measurements differs from a values “just by chance”.

The sum variables (Points, Deep_adj, Surf_adj, Stra_adj) were created as instructed by Kimmo.

Graphical overview

First, we will construct pairwise plots of each variable and draw bigger histograms to see the variables’ distribution.

Summaries were printed in the previous section, but here we clearly see how the distribution of ages is heavily skewed to the left whereas other variables have approximately symmetrical distributions. From the plot matrix, we can conclude that none of the variables correlate with each other much as the highest absolute correlation of 0.437 is with Points and Attitude which would be classified as weak positive correlation.

From the lower matrix it is also visible that males seem to have a bit better attitude toward statistics and that women tend to be more methodological with their approach to studying.

Regression

Let’s perform a linear regression where we assume that there is a linear dependency between the dependent variable and the explanatory variables. Linear model can be expressed as \[y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\] where \(x_1, x_2, x_3\) are the values of the explanatory variables and \(y\) the value of the dependent variable. Epsilon is considered as the error parameter or residuals, which should be normally distributed, \(\epsilon \sim N(0,\sigma^2)\). The residuals are examined via the diagnostic plots.

## 
## Call:
## lm(formula = Points ~ Attitude + Age + Stra_adj, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## Attitude     0.34808    0.05622   6.191 4.72e-09 ***
## Age         -0.08822    0.05302  -1.664   0.0981 .  
## Stra_adj     1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

From the summary it is visible that only attitude raises – or affects – performance in the final exam on 5 per cent significance level. For every attitude point gained it is estimated that an average of 0.35 exam points is achieved. We also notice from the output that age has the least significant effect on our model so we remove it from the model.

# Remove variable age by update() function
model <- update(model, formula. = .~. - Age)

summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + Stra_adj, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## Stra_adj     0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R squared of the model.

From the summary of the model we see that one could get approximately 9 points from the final exam with zero attitude and no strategy in learning. We can conclude that attitude has a very significant effect, which stays in the same level as in the previous model. The p-value given tests against the null hypothesis that the coefficient of the variable is zero (alpha, beta or gamma in the equation above). And as is seen, for every 1 point of attitude the score on the final exam rises approximately 0.35 points.

The R-squared of the output tells us how well our model explains the variability of the dependent variable. In this case, our model explains roughly 20.5 per cent of the variability, which is quite bad. According to some sources the \(R^2\) should be around 0.40 – 0.60.

Diagnostics

Let’s produce the diagnostic plots:

As was previously explained, the linear model relies on the fact that the residuals are normally distributed. The QQ-plot tells us that the assumption is slightly off: the residuals’ left tail seems to be okay as we only have 4 observations lying off the quantile-quantile-line whilst the right tail is slightly heavy. When distributions right tail is “heavy”, it means that it has more mass in there as the model predicts. The other plots show also no significant deviation from the assumptions that the residuals don’t depend on the fitted values.


IODS week 3

Glimpse to the data

Full information can be found in here.

## Column names:
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data set includes information about students’ performance participating in mathematics and Portuguese classes. As described in the website “The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

The data has information of 382 individuals on 35 subjects. Of the respondents, 198 identified female and 184 male. And the mean age was 16.8 years.

Studying relationships with alcohol consumption

I chose to model the alcohol consumption by explaining it with nursery school attendance, family size (famsize), relationships’ quality (famrel) and absences. I hypothesize that

  • attending nursery school lowers alcohol use
  • the bigger the family, the greater the chance for increased use of alcohol (because of non-existent or minor supervision from parents)
  • better family relationships account for lower use of alcohol
  • more absences are a clear indicator that something is wrong.

Nursery

Let’s cross tabulate nursery and alcohol use and take columns’ proportional percentages:

Nursery / high use
FALSE TRUE
no 0.172 0.228
yes 0.828 0.772

We see that a slightly larger proportion of those not who had not attended nursery school are in fact “high users”. Upon conducting a \(\chi^2\)-test it was verified that the difference is not significant (\(p \approx 0.25\)).

Family size

Let’s do a similar cross tabulation:

Family size / high use
FALSE TRUE
GT3 0.75 0.675
LE3 0.25 0.325

It is seen that the proportions are quite the same, though greater alcohol consumption seems to be associated with living in a smaller family.

Quality of family’s relations

Lets plot the proportions:

The bar plot supports our hypothesis, as high users seem to have worse family relations.

Absences

The plots – yet again – support our hypothesis, high users seem to have more absences from school.

Logistic regression

Next we will fit logistic regression model explain reasons for alcohol’s high use.

## 
## Call:
## glm(formula = high_use ~ nursery + famsize + famrel + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4516  -0.8364  -0.6850   1.1891   1.9524  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.09520    0.55780  -0.171 0.864485    
## nurseryyes  -0.46506    0.28932  -1.607 0.107967    
## famsizeLE3   0.39087    0.25613   1.526 0.126991    
## famrel      -0.23692    0.12408  -1.909 0.056211 .  
## absences     0.08883    0.02320   3.830 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 439.36  on 377  degrees of freedom
## AIC: 449.36
## 
## Number of Fisher Scoring iterations: 4

As is seen from the above summary, the number of absences is associated significantly with the high use of alcohol. When absences increase, high use is more probable. The family relations’ quality is almost significant (\(p \approx 0.056\)). Let’s print the odds ratios:

Regression model estimates with 95% CI’s
Coef 2.5 % 97.5 %
(Intercept) 0.909 0.301 2.706
nurseryyes 0.628 0.358 1.117
famsizeLE3 1.478 0.891 2.437
famrel 0.789 0.618 1.007
absences 1.093 1.046 1.146

In a statistical sense, the only inference that can be made is that absences affect alcohol use (or that they are associated, correlation doesn’t imply causality) which supports my original hypothesis. If we strictly look at the estimates, the nursery school attendance and family relations are in line with my original hypothesis. the effect of family size is not.

Exploring the predictive power of my model

##                prediction
## high_use_actual FALSE TRUE
##           FALSE   258   10
##           TRUE     97   17

## Total proportion of inaccurately classified individuals:
## [1] 0.2801047

Comments:

  • The model predicted the false ones quite well
  • In the true ones there were some issues: of the 114 true high users, only 17 was correctly identified.
  • If we would have just guessed the out comes by say, flipping a coin, we would have achieved significantly worse results. To confirm this, if we conducted a binomial test with parameters that we had \(258+17=275\) successes in \(258+10+97+17=382\) trials and that the underlying probability would have been 0.5, the test would have resulted in \(p \leq 2.2\cdot10^{-16}\). Htis means that with a fair coin (or arbitrary guesses) there would have been no way to be even this accurate.

Cross validation

## [1] 0.2801047
## [1] 0.2801047

My model has around 28 percent error rate, so it is worse than the one in DataCamp, but usually better than the one given by the cross validation function.

Super-über bonus task

Lets find the best model, as there are 33 usable variables for analyses, there are 8,589,934,591 possible combinations to use 1–33 variables.

# allPermutations <- gtools::permutations(33,4)
# 
# model_x <- glm(high_use ~ . -alc_use-high_use-probability-prediction-Walc-Dalc , data = alc, family = "binomial")
# pvals <- c(NA)
# for (index in 1:nrow(allPermutations)){
#   model_xs <- glm(alc$high_use ~ alc[, allPermutations[index, 1]]+ alc[, allPermutations[index, 2]] + alc[, allPermutations[index, 3]] + alc[, allPermutations[index, 4]], family = "binomial")
#   cv_x <- cv.glm(data = alc, cost = loss_func, glmfit = model_xs, K = 5)
#   pvals[index] <- cv_x$delta[1]
# }
# cv_x <- cv.glm(data = alc, cost = loss_func, glmfit = model_x, K = 10)
# cv_x$delta[1]

# Significance fathers occ , schoolsup and higher

IODS week 4: Clustering

Let’s load the data and look at it’s dimensions and structure:

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

So it appears, that the data has information on 14 variables from 506 individual observations.

library(ggplot2)
library(GGally)

ggpairs(Boston[,!colnames(Boston) %in% c('chas', 'rad')])

# Lets print the summaries without chas and ras which were classes.

summary(Boston[,!colnames(Boston) %in% c('chas', 'rad')])
##       crim                zn             indus            nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.3850  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :0.8710  
##        rm             age              dis              tax       
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   :187.0  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.:279.0  
##  Median :6.208   Median : 77.50   Median : 3.207   Median :330.0  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   :408.2  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:666.0  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :711.0  
##     ptratio          black            lstat            medv      
##  Min.   :12.60   Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
##  1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
##  Median :19.05   Median :391.44   Median :11.36   Median :21.20  
##  Mean   :18.46   Mean   :356.67   Mean   :12.65   Mean   :22.53  
##  3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :22.00   Max.   :396.90   Max.   :37.97   Max.   :50.00
table(Boston$chas)
## 
##   0   1 
## 471  35
table(Boston$rad)
## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132

Jänniä yhteyksiä.

boston_s <- as.data.frame(scale(Boston))

summary(boston_s)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_s$crime <- cut(boston_s$crim, breaks = quantile(boston_s$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

boston_s <- dplyr::select(boston_s, -crim)

selection <- sample(x = 1:nrow(boston_s), size = round(0.8*nrow(boston_s)))

train <- boston_s[selection, ]

test <- boston_s[-selection, ]
lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes , pch = classes)
lda.arrows(lda.fit, myscale = 1)

crimecat_actual <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = crimecat_actual, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       7        0    0
##   med_low    4      13        8    0
##   med_high   0       8       15    1
##   high       0       0        0   29

jännä

# k-means clustering
set.seed(777)

bst <- as.data.frame(scale(Boston))

km_5 <- kmeans(bst, centers = 5)

lda.fit2 <- lda(km_5$cluster ~ ., data = bst)

# target classes as numeric
class <- as.numeric(km_5$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = class , pch = class)
lda.arrows(lda.fit2, myscale = 1)

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 405  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
#plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)